feat: Support Spark expression: percentile_cont by YutaLin · Pull Request #3757 · apache/datafusion-comet

YutaLin · 2026-03-21T20:50:15Z

Which issue does this PR close?

Comet does not currently support the Spark percentile_cont function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

PercentileCont calculates a percentile value based on a continuous distribution of numeric or ANSI interval columns at a given percentage. It implements the SQL PERCENTILE_CONT function which uses linear interpolation between values when the exact percentile position falls between two data points. This expression is a runtime-replaceable aggregate that delegates to the internal Percentile implementation.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Not include array percentile and weighted percentile now.
Closes #3190

What changes are included in this PR?

Add PercentileCont message for expr.proto
Add CometPercentile with validations
Register Percentile class in QueryPlanSerde
Handle PercentileCont protobuf in planner.rs
Custom percentile.rs with Binary state because Datafusion percentile_cont stores all values as List and shuffle with that will cause Cannot cast list to non-list data types error

How are these changes tested?

Add sql test include numerical and interval

coderfender · 2026-03-22T00:24:01Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala

+    if (expr.percentageExpression.dataType.isInstanceOf[ArrayType]) {
+      withInfo(aggExpr, "array of percentiles not supported")
+      return None
+    }


nit : could be a better idea to test scalar and reject all other inputs to make the code more defensive ?

coderfender · 2026-03-22T00:25:06Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala

+    val childExpr = exprToProto(expr.child, inputs, binding)
+    val percentileExpr = exprToProto(expr.percentageExpression, inputs, binding)
+    val dataType = serializeDataType(expr.dataType)
+


nit : Any reason this if condition wouldn't always be true at all times ?

andygrove · 2026-03-22T15:14:55Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala

+import org.apache.spark.sql.catalyst.expressions.aggregate.{AggregateExpression, Average, BitAndAgg, BitOrAgg, BitXorAgg, BloomFilterAggregate, CentralMomentAgg, Corr, Count, Covariance, CovPopulation, CovSample, First, Last, Max, Min, Percentile, StddevPop, StddevSamp, Sum, VariancePop, VarianceSamp}
 import org.apache.spark.sql.internal.SQLConf
-import org.apache.spark.sql.types.{ByteType, DataTypes, DecimalType, IntegerType, LongType, ShortType, StringType}
+import org.apache.spark.sql.types.{ArrayType, ByteType, DataTypes, DayTimeIntervalType, DecimalType, IntegerType, LongType, NumericType, , StringType, YearMonthIntervalType}


Suggested change

import org.apache.spark.sql.types.{ArrayType, ByteType, DataTypes, DayTimeIntervalType, DecimalType, IntegerType, LongType, NumericType, , StringType, YearMonthIntervalType}

import org.apache.spark.sql.types.{ArrayType, ByteType, DataTypes, DayTimeIntervalType, DecimalType, IntegerType, LongType, NumericType, StringType, YearMonthIntervalType}

andygrove · 2026-03-22T15:16:12Z

spark/src/main/scala/org/apache/comet/serde/aggregates.scala

+    // Support numeric types and interval types
+    expr.child.dataType match {
+      case _: NumericType =>
+      case _: DecimalType =>


The match on DecimalType is unreachable (because NumericType already handles it)

Suggested change

case _: DecimalType =>

andygrove · 2026-03-22T15:17:15Z

spark/src/test/resources/sql-tests/expressions/aggregate/percentile_cont.sql

+-- Uses similar test data as Spark's percentiles.sql
+
+statement
+CREATE TABLE test_percentile(k int, v int) USING parquet


Thanks for adding the SQL tests! It would be great to add tests with negative values, boundary percentiles (0.0 and 1.0), all-null groups, single-value groups, and DOUBLE/FLOAT column types. The negative values test is especially important since the sort order matters for correctness and I think there may currently be bugs in the implementation, but would be good to start with adding tests

YutaLin added 2 commits March 21, 2026 16:49

feat: support percentile cont

c6368ea

Merge branch 'main' into 3190-support_percentile_cont

74e8c7b

coderfender reviewed Mar 22, 2026

View reviewed changes

andygrove reviewed Mar 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Spark expression: percentile_cont#3757

feat: Support Spark expression: percentile_cont#3757
YutaLin wants to merge 2 commits intoapache:mainfrom
YutaLin:3190-support_percentile_cont

YutaLin commented Mar 21, 2026

Uh oh!

coderfender Mar 22, 2026

Uh oh!

coderfender Mar 22, 2026

Uh oh!

andygrove Mar 22, 2026

Uh oh!

andygrove Mar 22, 2026

Uh oh!

andygrove Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	import org.apache.spark.sql.types.{ArrayType, ByteType, DataTypes, DayTimeIntervalType, DecimalType, IntegerType, LongType, NumericType, , StringType, YearMonthIntervalType}
	import org.apache.spark.sql.types.{ArrayType, ByteType, DataTypes, DayTimeIntervalType, DecimalType, IntegerType, LongType, NumericType, StringType, YearMonthIntervalType}

Conversation

YutaLin commented Mar 21, 2026

Which issue does this PR close?

What changes are included in this PR?

How are these changes tested?

Uh oh!

coderfender Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderfender Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

andygrove Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants